-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix return path of NodePort traffic. #130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may become incompatible with other features such as:
-
add Pod IP to NLB/ALB target group and use NLB/ALB to directly send traffic to Pod IP
-
Pod use different subnets and security groups from eth0.
@liwenwu-amazon Can you expand on your comments? I thought all this PR impacts is traffic directed at node port access via eth0. So it's not clear to me why this change would be incompatible with the NLB/ALB routing directly to the pod IP or pods with different subnets or security groups than eth0. It doesn't sound like either of those future use cases involve traffic being sent to node port? |
@liwenwu-amazon I think @lxpollitt is right, this PR only affects traffic that arrives at eth0 and then returns from a pod. I think that:
|
@fasaxc @lxpollitt , thanks for the PR. Can you add more detail on how you verify the PR, such as
|
@liwenwu-amazon the rules are static, and they amount to
and a routing rule that matches on the 0x80 mark bit to use the main routing table. The mangle PREROUTING chain is executed just before the nat PREROUTING chain, before the routing decision. The first rule only takes effect if the incoming packet comes in via eth0 ( The second rule matches only packets that are leaving local pods and it restores the mark bit. That means that the mark bit will only be set on packets that are part of a connection that started by coming in eth0 to a host IP. Hence, only connections that arrived on eth0 and were NATted to a pod will have the mark bit set, which should only catch NodePort traffic. This PR doesn't do anything for non-eth0 secondary IPs right now, those packets should be processed as before. I wasn't sure what the desired behaviour of such IPs was. The PR could be extended to use a mark value per ENI but that would require quite a bit of additional work. If a user is already ENI-aware, presumably they'll only be accessing pods attached to that ENI so the routing should just work in that case? We use similar rules in Calico and the impact of 1-2 rules of that kind should be negligible. The non-NodePort traffic should fail the first match criteria on each rule, which amounts to a 4-byte string comparison. For troubleshooting:
|
@fasaxc I am testing the change using https://kubernetes.io/docs/tasks/access-application-cluster/connecting-frontend-backend/ example on a cluster of 1 node. I am not able to see |
@liwenwu-amazon That's odd, I'll try to reproduce here. Please can you give me details of your set up?
|
I was able to reproduce locally. Previously I was testing on a kops cluster but after switching to EKS I hit issues with the fix. I'll keep digging to see if I can get to the bottom of it. |
It looks like the difference between kops and EKS is that EKS has strict RPF filtering on eth0 whereas it was disabled on the kops cluster. I find that the fix does seem to work if I set the RPF check on eth0 to "loose" mode with |
@fasaxc thanks for the RPF finding. Do you know why |
@fasaxc , can you explain why change eth0 to "loose" mode make it working? In another word, if it is not set (as it today in eks), which particular packet get dropped and make the instance |
@liwenwu-amazon With "strict" RPF, what happens is
With "loose" RPF, the check passes because the kernel accepts any return route rather than requiring the return route to leave the same interface. In my testing I used two nodes and 2 "frontend" pods, one on each host; in that scenario, I do understand why both nodes get marked as healthy. This happens because the DNAT on each host chooses the frontend pod at random. In addition, after an RPF failure, the kernel seems to drop the NAT state for the connection (so retried packets may go to a different DNAT rule). That means that you get this scenario:
In manual testing, I was seeing the SYN retries with tcpdump and the latency of a connection that went through this retry was 1s more than normal. With an ELB it was showing the nodes as healthy but I was seeing random latency spikes through the ELB. |
I've updated the PR to set the "loose" setting on eth0. PTAL. |
@fasaxc I am a bit confused on your notes why RPF check failed.
Here is route table for FrontEnd-Pod
ip rule show any packet destined to FrontEnd-Pod needs use main routing table
Here is incoming ELB Health check flow
My understanding of RPF check is that it will drop the packet if packet is received from the interface where it should be the output interface for the IP-DA. I don't quite follow why it fails PRF in this case. |
@liwenwu-amazon strict RPF drops a packet if the routing lookup for the reverse path yields a different interface than it came in on. So, when it is NAT'ed to
which presumably would forward out a different interface than A loose RPF check just verifies that it can route the reverse packet, but does not require that it goes out the same interface it came in, so would be successful in this case. |
@spikecurtis I thought routing table selection is done before DNAT, so it will use main routing table to perform RPF check. Do you see any stats (e.g. netfilter stats) to prove your finding? Also, do you have kernel code trace to prove kernel forwarding code will perform ip rule selection again during RPF checking? thanks |
@liwenwu-amazon the routing decision, including table selection has to happen after DNAT, otherwise the packet would not get forwarded correctly with the new destination. Reverse path filtering also happens during the routing decision. You can see in the kernel sources that the code that validates the reverse path, https://elixir.bootlin.com/linux/v4.17.6/source/net/ipv4/fib_frontend.c#L325 |
@spikecurtis, @fasaxc thanks for the reference. Interesting to know in this case it uses For monitor/debug/audit purpose, do you know any statistics we can use for this change? In another word, shall we add any additional to |
The only diagnostic for the RPF check I'm aware of is the following sysctl:
It causes the kernel to log all packets that are dropped by the RPF check. I haven't found any kernel stats for that. For the new iptables rules that we added, I suggest including One slight correction to the health check flow; I think the SNAT happens after the routing decision so |
@fasaxc Thanks again for the notes. The only concerns I have now is "disabling RPF check", where RPF is a security feature that can help to limit the malicious traffic and prevent IP address spoofing. How about adding a configuration knob that
|
OK, how about |
@fasaxc how about use
if |
I think it's very important that nodePorts work correctly with the default settings, and would therefore advocate that the default value of this config flag is to enable the changes in this PR. NodePorts are a part of standard Kubernetes networking, and if they don't work by default EKS and self-managed clusters that use the CNI plug in might not be considered conforming to the Kubernetes API spec. While NLB/ALB with direct pod access is a good technology, the reality is that it will not be adopted overnight, nor will it be adopted by 100% of users. Having trouble getting NodePorts working could end up being a source of adoption friction for our users, and I think we should avoid it as much as possible. I want to emphasize that this change does not disable RPF checks, it simply switches them from strict to loose. Loose mode is the recommended mode "if using asymmetric routing or other complicated routing" according to the Linux man pages. https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt It's also worth pointing out that unless the user has explicitly disable it, the AWS network will already check for the kind of IP spoofing that RPF is designed to prevent. |
@liwenwu-amazon I agree with @spikecurtis; I think it should default to making NodePorts work or users will get tripped up. WDYT? |
I've updated the patch along those lines. |
@fasaxc can you add some unit-test too? thanks |
@fasaxc @lxpollitt @spikecurtis I have following concerns with node-port as default:
|
Here is a comparison between node-port and using Pod VPC IP as target: |
@liwenwu-amazon I think we're all agreed that NLB/ALB/ELB -> Pod VPC IP is a good solution. The reason for supporting NodePorts too is that they're a standard Kubernetes feature. The user doesn't have to be using an NLB/ALB/ELB to use a NodePort. |
I've added some UT to the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For addressing CR comments, Is it possible to amend one of the previous commit? This way, we will not have too many commits at time PR merges?
173d128
to
0b8fba1
Compare
@liwenwu-amazon I've updated the PR to include the diagnostics you requested. |
@fasaxc @spikecurtis , can you help to come up a e2e test which we can manually verify for now:
thanks |
@fasaxc , I am getting following (with default setting) which does NOT seems right:
|
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Add diagnostics for env var configuration and sysctls. Fixes aws#75
I've updated the PR to print the active state of the environment variables instead of the raw text. For a E2E tests, I think this should be sufficient:
|
@fasaxc thanks for the E2E testcase! Here are the detail steps I am going to use to test this:
|
@liwenwu-amazon k8s is allow-by-default; this page explains how to install a default-deny policy (in the "Default Policies" section): https://kubernetes.io/docs/concepts/services-networking/network-policies/ Note: node ports and policy don't interact as you might hope. kube-proxy SNATs the traffic to ensure that return traffic goes via the ingress node so kubernetes policy can actually see the ingress node as the source of the traffic. Calico supports pre-NAT policy that allows for securing NodePorts but that is not exposed by the kubernetes API. It has to be applied to the host instead of the workload since that is where the NAT happens. |
@fasaxc thanks for the input. Here is the yaml for policy part, please let me know if they are correct and if you have any comments/suggestions:
|
Since this issue affects node ports, it's more important to test that Network Policy works correctly when restricting access to the the frontend, rather than the backend. So a policy such as
would allow connections from other hosts in the VPC (including K8s nodes, which is the important part). Then, you should be able to access the service by connecting to the node port. If you delete the allow policy and replace with a deny:
then you should see that some requests are denied if they are sent to a pod on a remote host. @fasaxc suggested that you may want to create two different services, and restrict the pods for each service to be on different nodes. This would make the test more deterministic, so that we can separately test nodePort -> local pod vs nodePort -> remote pod. |
@spikecurtis Thanks for suggestions. Since the |
Yeah, if there is only one pod and you hit the node port on each node, you will exercise both nodePort->local pod vs nodePort->remote pod. However, with only one pod, how do you ensure that the pod is on a secondary ENI? The way I thought you were going to do it is to create more pods than can fit on a single ENI. |
@spikecurtis Good catch! Yes. I will need to deploy some pods first to make sure IP of primary ENI are all used before deploy frontend Pod and sevice. |
@fasaxc @spikecurtis Thank you very much for this PR! |
Issue #, if available:
Fixes #75
Description of changes:
Add iptables and routing rules that
exits via eth0.
Testing performed
IptablesMangleAllowAction=RETURN
).ip rule
)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.