-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodePort not working properly for pods on secondary ENIs #75
Comments
A solution to this issue would be to mark nodeport traffic and use conntrack to force the reverse path through the primay ENI:
|
@lbernail can I reproduce this by:
right? thanks |
@liwenwu-amazon yes that would work but you need to access the service using nodeip:nodeport from outside the node: if I recall correctly accessing a nodeport from a pod will redirect you to the standard service endpoints and bypass the nodeport iptables rule |
@lbernail , we seems only can reproduce this problem when we disable external SNAT. #120 . Can you confirm if this is same case for you? With external SNAT, all traffic for node-port will get sent and received on eth0, so it works. When external SNAT #120 is enabled, the incoming traffic is sent to eth0, and outing traffic is sent to eth1, so it breaks the Linux connection tracking ... |
@liwenwu-amazon if the traffic is coming from outside the VPC-CIDR it will be SNATed so yes it will solve the issue. However if the request comes from the VPC-CIDR the answer will be routed using the additional ENI (if the target pod is not on the primary ENI) and you should have the reverse-path issue |
I think this issue impacts anyone who tries to use NodePorts from within the same VPC too, which seems fairly mainline. Since this is impacting my team, I've started working on a fix based on the above suggestion to use the connmark with a couple of changes:
Here's my current set of rules:
|
@fasaxc : you do it this way to increase performances? |
@lbernail The change to use the addrtype? That does a couple of things:
which also has the addrtype clause; this makes sure that we don't match traffic that isn't heading to an IP assigned to the host (i.e. traffic that is going directly to a pod but happens to be going to a port in the node port range)
When you combine the two rules, you get |
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Fixes aws#75
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Fixes aws#75
Today, aws-vpc-cni assigns a native VPC IP address to a Pod. So it works directly with AWS ALB and NLB without any port mapping. In another word, Pod IP can be added ALB and NLB 's target group. This improves network performance, operation, debugging and remove the need to manipulate IP tables or IPVS tables. There is a woke-in-progress PR/Support routing directly to pods in aws-alb-ingress-controller. We are also actively working on NLB controller, so that traffic towards a service VIP backed by NLB can directly sent to Pod IP. |
👍 |
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Fixes aws#75
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Add diagnostics for env var configuration and sysctls. Fixes aws#75
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Add diagnostics for env var configuration and sysctls. Fixes aws#75
Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Add diagnostics for env var configuration and sysctls. Fixes aws#75
Looks like this issue is still there if calico is used for network policy, as Calico triggers its own rules before the ones that solve the issue, see my comment here #231 (comment) |
When using nodeport services (with or without loadbalancer), if iptables redirects traffic to a local pod with an IP on a secondary interface, the traffic is dropped by the reverse path filter.
Everything seems to work OK because if the first SYN is dropped the client will retry (however queries load-balanced to the local pod take much longer) and will (probably) be sent to another host (or pod on primary interface).
This can be seen by logging martian packets. When traffic is sent to a local pod on a secondary interface, it will be dropped.
The reason is the following:
To trigger the issue consistently:
The text was updated successfully, but these errors were encountered: